Pollution
Pollution

1 My Video

2 Introduction

When considering the purchase of a house, potential buyers are also evaluating the broader environment encompassing the property. Air quality plays a significant role in this assessment, particularly concerning the presence of nitrous oxide, a prevalent air pollutant. This gas not only adversely affects lung health but also contributes to the formation of acid rain, thereby influencing the desirability of a residential area.

The primary sources of nitric oxide in the environment include the emissions from cars, trucks, and the combustion of fossil fuels by factories. These activities are major contributors to air pollution, directly impacting the quality of life in urban and suburban settings. Consequently, this issue underscores the urgency of advocating for legislative changes. Such reforms could include relocating the development of highways and roads further from residential areas, enhancing renewable technology, and imposing limits on the number of factories within certain regions.

City In Pollution
City In Pollution

Moreover, the interplay between air pollution and property values is evident, as deteriorating air quality often correlates with lower home prices. This decline in property values results in reduced property taxes and, subsequently, lower future income taxes collected from homeowners. Therefore, addressing air pollution is not only vital for health and environmental reasons but also for the economic stability and attractiveness of residential areas.

2.1 Variables

Data Labeling:

CRIM: per capita crime rate by town (numeric)

ZN: proportion of residential land zoned for lots over 25,000 sq.ft. (numeric)

INDUS: proportion of non-retail business acres per town (numeric)

CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise) (categorical)

NX: nitric oxides concentration (parts per 10 million) (numeric)

RM: average number of rooms per dwelling (numeric)

AGE: proportion of owner-occupied units built prior to 1940 (numeric)

DIS: weighted distances to five Boston employment centres (numeric)

RAD: index of accessibility to radial highways (numeric)

TAX: full-value property-tax rate per $10,000 (numeric)

PTRATIO: pupil-teacher ratio by town (numeric)

B: 1000(Bk - 0.63)^2 where Bk is the proportion of [people of African American descent] by town (numeric)

LSTAT: % lower status of the population (numeric)

MEDV: Median value of owner-occupied homes in $1000s (target variable) (numeric)

(Satre 2024)

2.2 Research Question

Is there a negative relationship between nitric oxide concentration and the median value of owner occupied homes?

Null Hypothesis: \(H_0: \beta_1 = 0\)

Alternative Hypothesis: \(H_A: \beta_1 < 0\)

2.2.1 Setting the correct working directory(extremely important)

getwd()
## [1] "/Users/damodarpai/Documents/Labratories/Project 2"

3 The Data

3.1 Reading in Data

Boston <- read.csv("Boston Housing Data/Boston.csv")
  Boston$CRIM <- as.double(Boston$CRIM)
  Boston$ZN <- as.double(Boston$ZN)
  Boston$INDUS <- as.double(Boston$INDUS)
  Boston$CHAS <- as.factor(Boston$CHAS)
  Boston$NX <- as.double(Boston$NX)  
  Boston$RM <- as.double(Boston$RM)
  Boston$AGE <- as.double(Boston$AGE)
  Boston$DIS <- as.double(Boston$DIS)
  Boston$RAD <- as.double(Boston$RAD)
  Boston$TAX <- as.double(Boston$TAX)
  Boston$PTRATIO <- as.double(Boston$PTRATIO)
  Boston$B <- as.double(Boston$B)
  Boston$LSTAT <- as.double(Boston$LSTAT)
  Boston$MEDV <- as.double(Boston$MEDV)
head(Boston) 

3.2 General summary of the data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Boston_NOX_MEDV = Boston %>% select(NX,MEDV) 
head(Boston_NOX_MEDV)
summary(Boston_NOX_MEDV)
##        NX              MEDV      
##  Min.   :0.3850   Min.   : 5.00  
##  1st Qu.:0.4490   1st Qu.:17.02  
##  Median :0.5380   Median :21.20  
##  Mean   :0.5547   Mean   :22.53  
##  3rd Qu.:0.6240   3rd Qu.:25.00  
##  Max.   :0.8710   Max.   :50.00
range1 = max(Boston_NOX_MEDV$NX) - min(Boston_NOX_MEDV$NX) 
range2 = max(Boston_NOX_MEDV$MEDV) - min(Boston_NOX_MEDV$MEDV)  
 
"Slope using the ranges of NX and MEDV:"
## [1] "Slope using the ranges of NX and MEDV:"
range2/range1
## [1] 92.59259

3.3 Problem with the Data

As you can see above, the data has several outliers where there are high levels of Median Values regardless of high Nitric Oxide concentrations. There are definitely many other variables that are affecting the price of houses but just to make sure that those variables aren’t affecting our data significantly, we remove the outliers that are outside the range of 3 IQR of the upper and lower quartile. We will use this data to make sure that it is more representative of the impact of our independent variable.

3.4 Finding a correlation in the data

library(ggplot2)
g = ggplot(Boston, aes(x = NX, y = MEDV)) + geom_point()
g = g + geom_smooth(method = "loess")
g
## `geom_smooth()` using formula = 'y ~ x'

### Removing the outliers

Q1 <- quantile(Boston_NOX_MEDV$MEDV, 0.25)
Q3 <- quantile(Boston_NOX_MEDV$MEDV, 0.75)  
IQR <- Q3 - Q1

# Define outlier thresholds
lower_bound <- Q1 - 3 * IQR
upper_bound <- Q3 + 3 * IQR

non_outliers <- Boston_NOX_MEDV$MEDV > lower_bound & Boston_NOX_MEDV$MEDV < upper_bound 

Boston_NOX_MEDV <- Boston_NOX_MEDV[non_outliers, ] 

The way we’re accounting for outliers is checking for values that are beyond 3 Interquartile ranges away from the mean value. If they are beyond 3 IQR, then they are removed from the dataset. Note that this is just data that is probably moreso affected by other variables like the house being extremely nice or being close to a great school or really close to a place of work. All variables that might overshadow the air pollution we’re testing for. Note, this doesn’t mean that we’re completely removing outlier-like values. There are still certain values that might skew the data. But by the standard we use in Element of Statistics, we use 3 * IQR to check for outliers.

3.5 Preliminary Graph Descriptors

library (s20x)
pairs20x(Boston_NOX_MEDV) 

# Estimating the Parameters

3.6 Why use a Simple Regression Model?

We want to model the relationship between our Y(Median Value of Boston Homes) and our X(Nitric Oxide Levels within the surrounding area of these homes). Since we have a myriad of data and we’re not even sure if there’s a relationship between the 2 variables, we try to see the probability that there is a relationship between those 2 variables.

3.7 Theoretical Basis of SLR

We will use the equation , \[ y = \beta_0 + \beta_1 x_i + \varepsilon_i \] Here we know \[\beta_0\] and \[\beta_1\] are our random variables that we are estimating for using regression.
This equation represents the mean of y which is the mean of MEDV. When we do a theoretical analysis of this, we get the following, there the expected of the error is 0.

\[\begin{align*} E(y) &= E(\beta_0 + \beta_1 x_i + \varepsilon_i) \\ &= \beta_0 + \beta_1 x_i + E(\varepsilon_i) \\ &= \beta_0 + \beta_1 x_i \end{align*}\]

3.8 Assumptions of SLR

  • Data is independently and identically distributed
  • Probability distribution of error is normal centered at 0 and variance of a constant.

We already know that the data fits the first requirement since we know that the price of these houses aren’t necessarily affecting each other and are distributed as a sample within the greater boston area. We also know that the probability distribution of error is normal and centered at 0 after we took the log of the data and saw that the data had a normal distribution. This is done later in the project but we can assume both for future SLR for now.

3.9 Method of Least Squares

To initially estimate my \(\beta_0\) and \(\beta_1\), I will use the method of least squares using the lm function. The function works by taking the squares of errors and finding the smallest sum when the individual squares are combined to create the smallest value.

bos.lm = lm(MEDV~NX, data = Boston_NOX_MEDV) 
summary(bos.lm) 
## 
## Call:
## lm(formula = MEDV ~ NX, data = Boston_NOX_MEDV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.703  -4.266  -1.472   3.362  30.441 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   41.231      1.472   28.02   <2e-16 ***
## NX           -35.350      2.598  -13.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.704 on 488 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.2736 
## F-statistic: 185.1 on 1 and 488 DF,  p-value: < 2.2e-16
?lm 

From our function, we get the values:

\[ \beta_0 = 41.404 \] \[ \beta_1 = -35.776 \]

3.10 Conidence Interval for the the slope of MEDV/NX

ciReg(bos.lm, conf.level=0.95, print.out=TRUE)
##             95 % C.I.lower    95 % C.I.upper
## (Intercept)       38.33953          44.12233
## NX               -40.45504         -30.24586

As we can see from the confidence interval, our \[\beta_1\] is from around -40 to -30 whcih is a hint that our null hypothesis isn’t true because 0 isn’t included within the interval.

3.11 Visualizing our data

plot(MEDV~NX,bg="Blue",pch=21,cex=1.2,
     ylim=c(0,1.1*max(MEDV)),xlim=c(0,1.1*max(NX)),
     main="Scatter Plot of Median Value of Homes and Nitric Oxide Concentration", data=Boston_NOX_MEDV)
abline(bos.lm)

3.12 Plotting Residual Sum of Squared, Total Sum of Squared, and Mean Sum of Squared

Plotting the RSS,MSS, and TSS gives us a visual understanding of the difference between our estimated line and the actual data or the mean.

yhat = with(Boston_NOX_MEDV,predict(bos.lm,data.frame(NX)))

3.12.1 Plotting RSS

plot(MEDV~NX,bg="Blue",pch=21,cex=1.2,
              ylim=c(0,1.1*max(MEDV)),xlim=c(0,1.1*max(NX)),
              main="Residual Line Segments of Median Value of Homes and Nitric Oxide Concentration", data= Boston_NOX_MEDV)
abline(bos.lm)
with(Boston_NOX_MEDV,{segments(NX,MEDV,NX,yhat)})
abline(bos.lm)

3.12.2 Plotting MSS

plot(MEDV~NX,bg="Blue",pch=21,cex=1.2,
             ylim=c(0,1.1*max(MEDV)),xlim=c(0,1.1*max(NX)),
             main="Mean of Nitrogen vs Ammonium", data= Boston_NOX_MEDV)
abline(bos.lm)
with(Boston_NOX_MEDV, abline(h=mean(MEDV)))
with(Boston_NOX_MEDV, segments(NX,mean(MEDV),NX,yhat,col="Red"))
abline(bos.lm)

3.12.3 Plotting TSS

plot(MEDV~NX,bg="Blue",pch=21,cex=1.2,
              ylim=c(0,1.1*max(MEDV)),xlim=c(0,1.1*max(NX)),
              main="Total Deviation Line Segments of Nitrogen vs Ammonium", data=Boston_NOX_MEDV) 
with(Boston_NOX_MEDV,abline(h=mean(MEDV)))
with(Boston_NOX_MEDV, segments(NX,MEDV,NX,mean(MEDV),col="Green"))

3.12.4 Total Sum Calculations of RSS, MSS, and TSS

RSS=with(Boston_NOX_MEDV,sum((MEDV-yhat)^2))
RSS
## [1] 21930.49
MSS=with(Boston_NOX_MEDV,sum((yhat-mean(MEDV))^2))
MSS
## [1] 8320.5
TSS=with(Boston_NOX_MEDV,sum((MEDV-mean(MEDV))^2))
TSS
## [1] 30250.99
MSS/TSS
## [1] 0.2750489

Here we can make use our RSS, MSS, and TSS calculations to get the R-Squared which is representative of the fitness of our model to our data.

3.13 Checking for Normality

normcheck(bos.lm, shapiro.wilk = TRUE) 

3.14 Lowess Smoother scatter plot of MEDV vs NX

library(s20x)
trendscatter(MEDV~NX, f = 0.5, data = Boston_NOX_MEDV, main="MEDV vs NX")

3.15 Problem with Normality

As we can see from our normcheck, the data doesn’t seem normal at all. In fact, the data seems completely right skewed. In order to correct it, we can take the log of our MEDV data and then create estimates based off of that.

4 Check the log of MEDV integrated into our linear model

4.1 Creating log data

Boston_NOX_MEDV_LOG = Boston_NOX_MEDV 
Boston_NOX_MEDV_LOG$MEDV = log(Boston_NOX_MEDV$MEDV)   

4.2 General Summary of the Data

library (s20x)
pairs20x(Boston_NOX_MEDV_LOG) 

4.3 Summary of the linear model

boslog.lm = lm(MEDV~NX, data = Boston_NOX_MEDV_LOG) 
summary(boslog.lm) 
## 
## Call:
## lm(formula = MEDV ~ NX, data = Boston_NOX_MEDV_LOG)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.13843 -0.17137 -0.01548  0.18474  1.05430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.03697    0.06929   58.27   <2e-16 ***
## NX          -1.86019    0.12232  -15.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3156 on 488 degrees of freedom
## Multiple R-squared:  0.3215, Adjusted R-squared:  0.3201 
## F-statistic: 231.3 on 1 and 488 DF,  p-value: < 2.2e-16

4.4 Confidence Interval

ciReg(boslog.lm, conf.level=0.95, print.out=TRUE)
##             95 % C.I.lower    95 % C.I.upper
## (Intercept)        3.90084           4.17311
## NX                -2.10052          -1.61985

4.5 Physical representation of the linear correlation

plot(MEDV~NX,bg="Blue",pch=21,cex=1.2,
     ylim=c(0,1.1*max(MEDV)),xlim=c(0,1.1*max(NX)),
     main="Scatter Plot of Median Value of Homes and Nitric Oxide Concentration", data=Boston_NOX_MEDV_LOG)
abline(boslog.lm)

4.6 Residual Data and Confirming Assumptions

yhatlog = with(Boston_NOX_MEDV_LOG,predict(boslog.lm,data.frame(NX)))

4.6.1 RSS of log data

plot(MEDV~NX,bg="Blue",pch=21,cex=1.2,
              ylim=c(0,1.1*max(MEDV)),xlim=c(0,1.1*max(NX)),
              main="Residual Line Segments of Median Value of Homes and Nitric Oxide Concentration", data= Boston_NOX_MEDV_LOG)
abline(boslog.lm)
with(Boston_NOX_MEDV_LOG,segments(NX,MEDV,NX,yhatlog))
abline(boslog.lm)

4.6.2 MSS of log data

plot(MEDV~NX,bg="Blue",pch=21,cex=1.2,
             ylim=c(0,1.1*max(MEDV)),xlim=c(0,1.1*max(NX)),
             main="Mean of Nitrogen vs Ammonium", data= Boston_NOX_MEDV_LOG)
abline(boslog.lm)
with(Boston_NOX_MEDV_LOG, abline(h=mean(MEDV)))
with(Boston_NOX_MEDV_LOG, segments(NX,mean(MEDV),NX,yhatlog,col="Red"))
abline(boslog.lm) 

4.6.3 TSS of log data

plot(MEDV~NX,bg="Blue",pch=21,cex=1.2,
              ylim=c(0,1.1*max(MEDV)),xlim=c(0,1.1*max(NX)),
              main="Total Deviation Line Segments of Nitrogen vs Ammonium", data=Boston_NOX_MEDV_LOG)
with(Boston_NOX_MEDV_LOG,abline(h=mean(MEDV)))
with(Boston_NOX_MEDV_LOG, segments(NX,MEDV,NX,mean(MEDV),col="Green"))

4.7 Comparing Normality

normcheck(boslog.lm) 

normcheck(bos.lm)

The log data is clearly more normal than the regular MEDV. We can tell clearly from the histogram. Though the p-value is still 0, the log data is more relevant because it fits within our assumption.

library(s20x)
trendscatter(MEDV~NX, f = 0.5, data = Boston_NOX_MEDV_LOG, main="MEDV vs NX")

4.8 Checking Cook’s distance to find further bias from outliers

If several points have a significant Cook’s distance then it will show in the graph. Note that these aren’t all of the outliers but just some of the more significant ones that are affecting the data. In a perfect world we could remove some of these on top of the ones we removed at the beginning.

cooks20x(boslog.lm)

Cook’s Distance for the linear model with log(MEDV) data shows that there are significant outliers at the observation numbers of 151,160,and 383. In a perfect world, I would remove these values since they are affecting the model.

5 Check a different model(Quadratic Model)

quad.lm=lm(MEDV~NX + I(NX^2),data=Boston_NOX_MEDV_LOG)

plot(MEDV~NX,bg="Blue",pch=21,cex=1.2,
   ylim=c(0,1.1*max(MEDV)),xlim=c(0,1.1*max(NX)),main="Scatter Plot and Quadratic of Nitrogen vs Ammonium",data = Boston_NOX_MEDV_LOG)
myplot = function(x){quad.lm$coef[1] + quad.lm$coef[2]*x + quad.lm$coef[3]*x^2}

curve(myplot, lwd = 2, add = TRUE)

The following are all necessary to see how the data for quad relates to the data for linear models.

5.1 Checking for Normality

normcheck(quad.lm, shapiro.wilk = TRUE)

5.2 Summary of the data

summary(quad.lm) 
## 
## Call:
## lm(formula = MEDV ~ NX + I(NX^2), data = Boston_NOX_MEDV_LOG)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11520 -0.17849 -0.00146  0.17263  1.10117 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.2602     0.3103  16.951  < 2e-16 ***
## NX           -6.1358     1.0650  -5.761 1.48e-08 ***
## I(NX^2)       3.5744     0.8846   4.041 6.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3108 on 487 degrees of freedom
## Multiple R-squared:  0.3435, Adjusted R-squared:  0.3409 
## F-statistic: 127.4 on 2 and 487 DF,  p-value: < 2.2e-16

5.3 General plot of the data

plot(quad.lm, which = 1)

5.4 Confidence Interval of the data

ciReg(quad.lm, conf.level=0.95, print.out=TRUE)
##             95 % C.I.lower    95 % C.I.upper
## (Intercept)        4.65045           5.86989
## NX                -8.22838          -4.04329
## I(NX^2)            1.83628           5.31254

5.5 Comparison of the Two Models

summary(quad.lm) 
## 
## Call:
## lm(formula = MEDV ~ NX + I(NX^2), data = Boston_NOX_MEDV_LOG)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11520 -0.17849 -0.00146  0.17263  1.10117 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.2602     0.3103  16.951  < 2e-16 ***
## NX           -6.1358     1.0650  -5.761 1.48e-08 ***
## I(NX^2)       3.5744     0.8846   4.041 6.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3108 on 487 degrees of freedom
## Multiple R-squared:  0.3435, Adjusted R-squared:  0.3409 
## F-statistic: 127.4 on 2 and 487 DF,  p-value: < 2.2e-16
summary(boslog.lm)
## 
## Call:
## lm(formula = MEDV ~ NX, data = Boston_NOX_MEDV_LOG)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.13843 -0.17137 -0.01548  0.18474  1.05430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.03697    0.06929   58.27   <2e-16 ***
## NX          -1.86019    0.12232  -15.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3156 on 488 degrees of freedom
## Multiple R-squared:  0.3215, Adjusted R-squared:  0.3201 
## F-statistic: 231.3 on 1 and 488 DF,  p-value: < 2.2e-16

5.5.1 Contrast of the R-squared-values

Just looking at the 2 R-squared-values of the data, there is a Multiple R-Squared Value of 0.3435 for the quadratic model and a value of 0.3215 for the linear model that takes into account log(MEDV). We know that R-Squared represents the fitness of the model to a particular set of data, thus we know that the quadratic model is more applicable for the data. That said, it only has a fitness of 0.02 more, so it’s clear that there’s no significant difference between having a linear or quadratic model.

6 Results and Conclusion

To actually run an SLR test, we made sure that our assumptions were cleared for our data. We can clearly tell from our linear regression model summary and more importantly our confidence interval for beta 1, that there is a statistical relationship between MEDV and NX.We know this because 0 isn’t within the 95 percent confidence interval which means that if we had a statistically significant number of times, the beta 1 will be within the range 95 percent of the time. Though our R squared are low for our data, we know that there are several extraneous variables that are affecting the data to the extent that our outliers are heavily affeccting our R-Squared and creating large standard deviations. To clarify, we reject our null hypothesis and thus believe that there is a negative correlation between MEDV and NX.

6.1 Moving Forward

If I were to redo this, I would remove more outliers. I would also try to apply more models because clearly the R-squared was low throughout the entire analysis.

7 References

Mendenhall, William, and Terry Sincich. Statistics for Engineering and the Sciences. Prentice Hall.

Mullane, Joseph. “Ulez Charge on Homes Would Tackle Air Pollution and Rising NO2, Say Researchers.” Homebuilding & Renovating, Homebuilding, 22 Aug. 2023, www.homebuilding.co.uk/news/ULEZ-charge-on-homes.

Miłuch, Oktawia, and Katarzyna Kopczewska. “Fresh air in the city: The impact of air pollution on the pricing of Real Estate.” Environmental Science and Pollution Research, vol. 31, no. 5, 2 Jan. 2024, pp. 7604–7627, https://doi.org/10.1007/s11356-023-31668-1.